Document relevancy analysis within machine learning systems

ABSTRACT

Systems and methods that quantify document relevance for a document relative to a training corpus and select a best match or best matches are provided herein. Methods may include generating an example-based explanation for relevancy of a document to a training corpus by executing a support vector machine classifier, the support vector machine classifier performing a centroid classification of a relevant document in a term frequency-inverse document frequency features space relative to training examples in a training corpus, and generating an example-based explanation by selecting a best match for the relevant document from the training examples based upon the centroid classification.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.13/952,501, filed Jul. 26, 2013, which is a continuation of U.S. patentapplication Ser. No. 13/632,943, filed Oct. 1, 2012 and issued Sep. 10,2013 as U.S. Pat. No. 8,533,148, all of which are hereby incorporated byreference in its entirety, including all references and appendices citedtherein.

FIELD OF THE TECHNOLOGY

Embodiments of the disclosure relate to machine learning systems thatquantify the relevancy of a document relative to a training corpus byproviding simple, intuitive and valid explanations for why a document isrelevant to a training corpus. Additionally, present technology mayprovide best-matches of training documents included in the trainingcorpus relative to a document selected by a classifier as relevant(e.g., training documents that are semantically close to the selecteddocument).

BACKGROUND OF THE DISCLOSURE

Machine learning systems may utilize highly complex analysis algorithmsto generate statistically valid document recommendations relative totraining documents of a training corpus. While these recommendations arestatistically valid, end users may prefer example-based explanationswhen trying to understand why a certain document was recommended by themachine learning system.

Some machine learning systems attempt to avoid non-intuitive algorithmsaltogether when human-understandability of results is of particularimportance. This approach may be acceptable in domains whereclassification accuracy is not of highest importance and, typically,where the number of dimensions that are used to describe the problem islow. Unfortunately, a low dimensional space rarely, if ever, occurs indocument classification processes.

Another approach is to use feature reduction algorithms to either reducethe input space or the complexity of the solution. Further approachesmay involve interactive visualization methods that put the burden offinding the most suitable explanation (e.g., most relevant match) on theuser. Additionally, these interactive visualization methods oftenrequire significant computing resources in addition to excessive orundesirable end user effort.

The success of feature reduction algorithms is highly domain dependent.They are most suitable in domains where input dimensions differ inquality, meaning, and/or where dimensions are redundant. Irrelevant andredundant dimensions (e.g., words) are typically filtered out via stopword lists or combined by phrase detection algorithms in documentclassification domains so that a further reduction of dimensions oftenleads to a significant deterioration in classification accuracy.

End users are typically not interested in algorithmic explanations,whether those explanations are simple or complex. The end user desiresto inspect specific training examples that are most likely the cause forthe given classification of a new document.

SUMMARY OF THE DISCLOSURE

According to some embodiments, the present technology may be directed tomethods for quantifying relevancy of a document to a training corpus by:(a) calculating an internal best match score for a relevant documentrelative to each training example in the training corpus by: (i)determining cosine distances between the relevant document and trainingexamples in the training corpus relative to term frequency-inversedocument frequency weights associated with the training examples; (b)determining the training example having a closest cosine distance to therelevant document; and (c) outputting the training example having theclosest cosine distance to the relevant document.

According to other embodiments, the present technology may be directedto machine learning systems that quantify relevancy of a document to atraining corpus. The machine learning systems may comprise: (a) at leastone server comprising a processor configured to execute instructionsthat reside in memory, the instructions comprising: (i) a classifiermodule that: (1) calculates an internal best match score for a relevantdocument relative to each training example in the training corpus bydetermining cosine distances between the document and training examplesin the training corpus relative to term frequency-inverse documentfrequency weights associated with the training examples; and (2)determines the training example having the closest cosine distance tothe relevant document; and (ii) a user interface module that outputs thetraining example having the closest cosine distance to the relevantdocument.

According to additional embodiments, the present technology may bedirected to methods for generating an example-based explanation forrelevancy of a document to a training corpus. These methods maycomprise: (a) executing a support vector machine classifier that (i)generates a classification model using the training corpus; and (ii)classifies subject documents using the classification model; (b)creating a centroid classification for a selected relevant document in aterm frequency-inverse document frequency feature space; and (c)generating an example-based explanation by selecting a best match forthe selected relevant document from the training examples from thetraining corpus.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, where like reference numerals refer toidentical or functionally similar elements throughout the separateviews, together with the detailed description below, are incorporated inand form part of the specification, and serve to further illustrateembodiments of concepts that include the claimed disclosure, and explainvarious principles and advantages of those embodiments.

The methods and systems disclosed herein have been represented whereappropriate by conventional symbols in the drawings, showing only thosespecific details that are pertinent to understanding the embodiments ofthe present disclosure so as not to obscure the disclosure with detailsthat will be readily apparent to those of ordinary skill in the arthaving the benefit of the description herein.

FIG. 1 illustrates an exemplary system for practicing aspects of thepresent technology;

FIG. 2 shows a schematic diagram of an exemplary document relevancyapplication;

FIG. 3 is an exemplary graphical user interface that comprises a list ofdocuments which have been marked as relevant by the user;

FIG. 4 is an exemplary graphical user interface that allows the end userto select how the classification module trains various aspects of thecategory/domain;

FIG. 5 is an exemplary graphical user interface that illustrates a BestMatches View;

FIG. 6 is an exemplary graphical user interface for Best Matches that isoverlaid upon the interface of FIG. 5;

FIG. 7 is an exemplary graphical user interface having ten best-matchesfor the relevant document;

FIG. 8 is a flowchart of an exemplary method for quantifying relevancyof a document to a training corpus; and

FIG. 9 is a flowchart of an exemplary method for generating anexample-based explanation for relevancy of a document to a trainingcorpus; and

FIG. 10 illustrates an exemplary computing system that may be used toimplement embodiments according to the present technology.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS

In the following description, for purposes of explanation, numerousspecific details are set forth in order to provide a thoroughunderstanding of the disclosure. It will be apparent, however, to oneskilled in the art, that the disclosure may be practiced without thesespecific details. In other instances, structures and devices are shownat block diagram form only in order to avoid obscuring the disclosure.

Generally speaking, the present technology is directed to systems andmethods that perform document relevancy analyses within the context ofmachine learning systems.

Systems and methods provided herein may be used to quantify therelevancy of documents are available for various fields. In someinstances, the present technology may be utilized to explain suggestions(e.g., suggested/relevant documents) generated by various machinelearning technologies, such as support vector machines (“SVM”).

The present technology may be trained based on manually providedtraining data. The present technology may suggest further “suggesteddocuments” (e.g., relevant documents) to an end user. When the end userselects one of the suggested documents, the present technology may thengenerate a ranked sub-list of the original training data examples. Thetop-most entries explain best why the single selected suggested documenthas been suggested.

The present technology may combine statistical learning algorithms(using a support vector machine), which is used to train a category ordomain of training examples. Additionally, the present technology mayutilize an established similarity measure for comparing relevantdocuments to the training examples.

Broadly speaking, the present technology may use the training results ofa support vector machine (“SVM”) to suggest documents (e.g., relevantdocuments) to the end user. As used herein, a “suggested” documentindicates that the SMV has determined that the end user should includethe document in a particular category or classification. To this end,the SVM may analyze documents for which the user decided that arerelevant, for example, those documents that the end user has determinedto belong to a specific category.

In some instances, the end user wishes to understand why the SVMselected such documents as “relevant” in order to learn more about thedata. However, the end user may not want to see complicated graphics.Furthermore, the end user may not want complicated explanations orinsight into the mathematical details of the learning procedure. Thepresent technology provides the end user with an example-basedexplanation of the form “the document has been suggested because itmatches the designated input documents very closely.”

An example-based explanation may be a challenge for the SVMs to generatebecause the involved algorithms operate on the whole set of inputdocuments at once, often using global optimization schemes. Afterwards,SVMs provide insight about relevant words inside of documents orartificial “documents” which contain these words (the “supportvectors”). The end-user, however, may desire to determine the best matchor a ranked list of matches for the designated input documents. Forexample, the end user desires to inspect examples, and particularly themost relevant examples, selected from the training examples used totrain the SVMs.

In furtherance thereof, the present technology may determine the cosinedistance between the relevant document and the training examplesrelative to a tf-idf weight vector space. Generally described, thetf-idf may comprise a weight (term frequency-inverse documentfrequency), which is a numerical statistic that reflects how important aword is to a document in a collection or training corpus. It is oftenused as a weighting factor in information retrieval and text mining. Thetf-idf value increases proportionally to the number of times a wordappears in the document, but is offset by the frequency of the word inthe corpus training examples, which helps moderate the fact that somewords are generally more common than others.

During calculation of cosine distances, the present technology mayassign particular weights to the words of each document. In someinstances the weights depend on the complete document corpus of trainingexamples.

In operation, the end user picks one “relevant” document for which theend user seeks an explanation. The end user wants to know why the SVMdeems this document as “relevant.” The present technology then locatestraining examples which are semantically close to that selected andrelevant document. In some instances the search may be restricted to theset of training examples that the user marked as “these belong to mycategory” and which have been used by the SVM in order to learn thecategory/domain. Document “closeness” may be defined by means of thetf-idf similarity measure. In fact, the search for “best matches” withrespect to the selected “relevant” document among the set of trainingexamples is conceptually equivalent to the training of a centroidclassifier which is centered on the single selected “relevant” document.

The present technology provides a fast and simple way to trigger theaforementioned computations and to inspect the results. Furthermore, thepresent technology unifies results such that not only explanations for asingle suggested document can be retrieved, but also results for anysuggested documents for which the end user seeks explanations regardingwhy the SVM suggested the documents.

FIG. 1 illustrates an exemplary system for practicing aspects of thepresent technology. The system may include a machine learning system 105that may include one or more web servers, along with digital storagemedia device such as databases. The machine learning system 105 may alsofunction as a cloud-based computing environment that is configured toprocess electronic documents in accordance with various embodiments ofthe present technology. Details regarding the operation of machinelearning system 105 will be discussed in greater detail with regard toFIG. 2.

In general, a cloud-based computing environment is a resource thattypically combines the computational power of a large grouping ofprocessors and/or that combines the storage capacity of a large groupingof computer memories or storage devices. For example, systems thatprovide a cloud resource may be utilized exclusively by their owners,such as Google™ or Yahoo! ™; or such systems may be accessible tooutside users who deploy applications within the computinginfrastructure to obtain the benefit of large computational or storageresources.

The cloud may be formed, for example, by a network of web servers, witheach web server (or at least a plurality thereof) providing processorand/or storage resources. These servers may manage workloads provided bymultiple users (e.g., cloud resource customers or other users).Typically, each user places workload demands upon the cloud that vary inreal-time, sometimes dramatically. The nature and extent of thesevariations typically depend on the type of business associated with theuser.

A plurality of client devices 110 a-n may communicatively couple withthe machine learning system 105 via a network connection 115. Thenetwork connection 115 may include any one of a number of private andpublic communications mediums such as the Internet. The client devices110 a-n may be required to be authenticated with the machine learningsystem 105 via credentials such as a username/password combination, orany other authentication means that would be known to one of ordinaryskill the art with the present disclosure before them.

FIG. 2 illustrates a block diagram of an exemplary document relevancyapplication, hereinafter application 200, which is constructed inaccordance with the present disclosure. The application 200 may residewithin memory of the machine learning system 105. According to someembodiments, execution of the application 200 by a processor of themachine learning system 105 may cause the machine learning system 105 toquantify relevancy of a document to a training corpus by firstcalculating an internal best match score for a relevant documentrelative to each training example in the training corpus. The internalbest match score may be calculated by determining cosine distancesbetween the relevant document and training examples in the trainingcorpus relative to term frequency-inverse document frequency weightsassociated with the training examples. Additionally, the machinelearning system 105 may determine the training example having theclosest cosine distance to the relevant document, as well as output thetraining example having the closest cosine distance to the relevantdocument. As mentioned above, the training example having the closestcosine distance to the relevant document may also be referred to as the“best match” for the relevant document.

The application 200 may comprise a plurality of modules such as a userinterface module 205, a document transformation module 210, and aclassification module 215. It is noteworthy that the application 200 mayinclude additional modules, engines, or components, and still fallwithin the scope of the present technology. As used herein, the term“module” may also refer to any of an application-specific integratedcircuit (“ASIC”), an electronic circuit, a processor (shared, dedicated,or group) that executes one or more software or firmware programs, acombinational logic circuit, and/or other suitable components thatprovide the described functionality. In other embodiments, individualmodules of the application 200 may include or be executed on separatelyconfigured web servers.

The client nodes may interact with the application 200 via one or moregraphical user interfaces that are generated by the user interfacemodule 205. Additionally, example-based explanations of documentrelevancy may be provided to the client devices via one or moregraphical user interfaces. Various graphical user interfaces generatedby the user interface module are illustrated in FIGS. 3-7, which will bedescribed in greater detail below.

Prior to providing example-based explanations of document relevancy, thedocument transformation module 210 may be executed to transform eachrelevant document and/or the training examples to a high-dimensionalfeature space using term (e.g., word) frequencies. The documenttransformation module 210 may employ the definition

x ^((j)) =tf(x,j)≥0

that determines a relative number of occurrences of term j in documentx. It will be understood that a “term” may comprise an original word asit occurred in an input document, a stemmed word, or a phrase. Stopwords may be excluded by the document transformation module 210. Astemmed word may result from standard stemming procedures that would beknown to one of ordinary skill in the art with the present disclosurebefore them. A phrase is a combination of two or more words, which maybe computed by one or more statistical methods that would also be knownto one of ordinary skill in the art.

Additionally, prior to providing example-based explanations of documentrelevancy, the classification module 215 may be trained on a trainingcorpus that comprises training examples. These training examples may beselected by the end user. The classification module 215 may classify aset of documents and determine relevant documents in the set using a SVMmodel, such as a hyperplane.

More specifically, the classification module 215 may be executed tocalculate internal best match scores for a relevant document d for eachtraining example x in a training corpus. In some instances, theclassification module 215 may comprise a linear SVM. It can be assumedthat the classification module 215 has been trained on a non-emptytraining data set defined by

T={x ₁ , . . . ,x _(N) }∪D∪R ^(in)

where D is the document universe containing all documents and m is thedimension of the feature space. A document dϵD†T has been suggested bythe classification module 215 using an exemplary scoring mechanism. Byway of non-limiting example, the classification module 215 may determinea distance for the document d relative to a SVM model hyperplane.

In order to establish a connection between the document d and one of thetraining examples x, the classification module 215 may define internalbest match scores with respect to d for each training documents definedby x_(i), i=1, . . . , N, as follows: let

${\overset{\sim}{x}:={{\left( {\sqrt{x^{(j)}} \cdot {{idf}_{j}(D)}} \right)\mspace{14mu} j} = 1}},{\ldots \mspace{14mu} m}$

be the tf-idf weight of

x=(x ⁽¹⁾ , . . . ,x ^((m)))ϵD.

Additionally, the classification module 215 may utilize the followingequation

${{idf}_{j}(D)} = {\log \frac{D}{\left( {\left. {x \in D} \middle| x_{j} \right. = 0} \right)}}$

to calculate the inverse document frequency for the training examples.

The classification module 215 may then apply a square-roottransformation to term-frequencies to dampen the internal best matchscores such that the internal best match scores rise linearly with thenumber of overlapping term counts, rather than quadratically, as for rawfrequency counts.

Next, the classification module 215 may utilize

${B\_ i}:={{B\left( {{x\_ i},d} \right)}:=\frac{{\overset{\sim}{x}}_{1}\overset{\sim}{d}}{{}{\overset{\sim}{x}}_{i}{}{}\overset{\sim}{d}{}}}$

to calculate an internal best matches score for training document i withrespect to the relevant document d. The scalar product divided by thevector lengths resembles the cosine distance between x_i and d in thetf-idf feature space.

According to some embodiments, the classification module 215 may thenstretch the internal best matches scores linearly to cover the completeunit interval in order to provide a suitable ranking (e.g., explanation)of the relevant document for the end user. The classification module 215may utilize

$b_{i}:={{b\left( {x_{i},d} \right)}:=\frac{B_{i} - {\min \left( B_{j} \right)}}{{\max \left( B_{j} \right)} - {\min \left( B_{j} \right)}}}$

to calculate final best matches scores with respect to the relevantdocument d, wherein min and max are computed over j=1, . . . , N. Thus,the training document b_(i) that equals 100% may be used as the bestexplanation for why the document d has been suggested by theclassification module 215 as the closest training document. It isnoteworthy that this approach is actually a training procedure for acentroid classifier trained from the single relevant document d wherethe centroid is d. The tf-idf scoring mechanism utilized by theclassification module 215 allows a concise ranking of results.

FIG. 3 is an exemplary graphical user interface 300 that comprises alist of manually categorized documents: the end user decided that theybelong to the category “Relevant.” The interface 300 displays the resultof this decision. In this context, column 310 indicates a relevancy of100% because all manually tagged documents have been requested. Thesub-pane 315 comprises descriptive data regarding a selected item oflist 305. As a next step, the user may start machine learning algorithmsto let the computer suggest further relevant documents.

FIG. 4 illustrates an exemplary graphical user interface 400 that allowsthe end user to select how the classification module trains variousaspects of the category/domain. The interface 400 allows the end user todefine a set of documents in his category as “Relevant” by selecting theRelevant check box 405. In an exemplary use case it will be assumed thatthe classification module locates eighty seven documents which have beenassigned by the classification module to the “Relevant” category.

FIG. 5 is an exemplary graphical user interface 500 that illustrates aBest Matches View. The “Relevant” category belongs to the “Topics”taxonomy. It has been configured as “Best matches” taxonomy, meaningthat best matches will be computed. After the training of theclassification module has finished, the end user sees all documents forwhich the classification module determines that the end user shouldconsider being part of category “Relevant.” The screen switches to coloryellow to put emphasis on the fact that only computer suggesteddocuments are displayed.

The end user typically wishes to inspect the training example whichexplains best why the relevant document has been suggested (e.g., hewants to see the best-match for a single selected document). Ifavailable, the end user may also locate further training documents whichare almost as relevant. The end user may select a threshold relevancyvalue that allows the present technology to determine the number ofdocuments that are close to the relevant document and provide abest-match explanation.

Threshold normalization simplifies the user experience when the end userretrieves best-matches for a different suggested document. For example,the end user may select 95% as relevancy threshold and always gets thebest-match (which has 100%) and perhaps some which are close to the verybest match. The threshold can be selected using the slider below thedocument list 500 (also referred to as interface 500 or graphical userinterface 500): the value shown in FIG. 5 is 85% meaning that the numberof best-matches with best-matches rank of at least 85% will be shown incolumn 515.

Note that the document display contains highlighted terms 505 and 510.These terms constitute the outcome of the support vector machinecategorization. That is, the highlighted terms 505 and 510 provide oneway to analyze what has been trained. For example, “west-germany” isamong the important concepts of the “Relevant” category. The interface500 comprises a column 515 entitled Watches“. This column 515 shows thenumber of training documents for which the best-matches threshold isgreater than 85%. This number has been selected by means of a slidermechanism, although other mechanisms for selecting a best-matchesthreshold may also be utilized. Clicking on the number “3” in column 515Watches” opens the best-matches user interface shown in FIG. 6.

FIG. 6 is an exemplary graphical user interface 600 for Best Matchesthat is overlaid upon the interface 500 of FIG. 5. Here, we see thethree matches 605, 610, and 615, two of which are 100% relevant (e.g.,605 and 610). These are the two best matches as defined by the b_inumbers calculated by the classifier module as described above. Thethird entry has similarity of 86%. Each of these documents has a stateof “Manually coded” 620. The manually coded indicator informs the enduser that the document belongs to the original training data set. Onlythose original documents are considered here. The document display showsthat the selected document ‘reuters14829 . . . ’ is associated withwest-germany.

FIG. 7 is an exemplary graphical user interface 700 having tenbest-matches for the document including a document with a “Pos.” 3. Theten closest best-matches (e.g., closest cosine distances), two of whichhave best-matches similarity 100%, documents 705 and 710. The 100%refers to the best match with respect to the selected suggesteddocument. The stretching means may be similar for every document, eventhough particular documents may have more than one document which hasthe same similarity.

FIG. 8 is a flowchart of an exemplary method 800 for quantifying therelevancy of a document to a training corpus. The method 800 maycomprise a step 805 of training a classifier on a training corpus thatcomprises training examples. Additionally, the method may comprise astep 810 of classifying a set of documents, as well as a step 815 ofdetermining relevant documents in the set.

To provide any example-based explanation for why a relevant document isrelevant to one or more training examples in the training corpus, themethod may comprise a step 820 of calculating an internal best matchscore for a relevant document relative to each training example in thetraining corpus. Calculating an internal best match score may comprise astep 825 of determining cosine distances between the relevant documentand training examples in the training corpus relative to termfrequency-inverse document frequency weights associated with thetraining examples. Next, the method 800 may comprise a step 830 ofdetermining the training example having the closest cosine distance tothe relevant document, as well as a step 835 of outputting the trainingexample having the closest cosine distance to the relevant document.

It is noteworthy that in some instances, steps 805-815 may be executedseparately from steps 820-835. That is, the classification of documentsmay be performed prior to the calculation of best matches for a relevantdocument.

FIG. 9 is a flowchart of an exemplary method 900 of generating anexample-based explanation for relevancy of a document to a trainingcorpus. Generally, the method 900 may comprise a step 905 of executing asupport vector machine classifier. Execution of the SVM may comprisevarious steps such as a step 910 of generating a classification modelusing the training corpus. Next, the method may comprise a step 915 ofclassifying subject documents using the classification model. Afterclassification, the method may comprise a step 920 of performing acentroid classification of a selected relevant document in a termfrequency-inverse document frequency feature space, as well as a step925 of generating an example-based explanation by selecting a best matchfor the selected relevant document from the training corpus. Next, themethod may comprise a step 910 of generating an example-basedexplanation by selecting a best match for the relevant document from thetraining examples based upon the centroid classification.

The computing system 1000 of FIG. 10 may be implemented in the contextsof the likes of computing systems, networks, servers, or combinationsthereof. The computing system 1000 of FIG. 10 includes one or moreprocessors 1100 (also referred to as processor unit 1100) and mainmemory 1200. Main memory 1200 stores, in part, instructions and data forexecution by processor 1100. Main memory 1200 may store the executablecode when in operation. The system 1000 of FIG. 10 further includes amass storage device 1300, portable storage medium drive(s) 1400 (alsoreferred to as portable storage device 1400), output devices 1500, userinput devices 1600, a graphics display 1700 (also referred to as displaysystem 1700), and peripheral devices 1800.

The components shown in FIG. 10 are depicted as being connected via asingle bus 1900. The components may be connected through one or moredata transport means. Processor unit 1100 and main memory 1200 may beconnected via a local microprocessor bus, and the mass storage device1300, peripheral device(s) 1800, portable storage device 1400, andgraphics display 1700 may be connected via one or more input/output(I/O) buses.

Mass storage device 1300, which may be implemented with a magnetic diskdrive or an optical disk drive, is a non-volatile storage device forstoring data and instructions for use by processor unit 1100. Massstorage device 1300 may store the system software for implementingembodiments of the present technology for purposes of loading thatsoftware into main memory 1200.

Portable storage device 1400 operates in conjunction with a portablenon-volatile storage medium, such as a floppy disk, compact disk,digital video disc, or USB storage device, to input and output data andcode to and from the computing system 1000 of FIG. 10. The systemsoftware for implementing embodiments of the present technology may bestored on such a portable medium and input to the computing system 1000via the portable storage device 1400.

Input devices 1600 provide a portion of a user interface. Input devices1600 may include an alphanumeric keypad, such as a keyboard, forinputting alpha-numeric and other information, or a pointing device,such as a mouse, a trackball, stylus, or cursor direction keys.Additionally, the system 1000 as shown in FIG. 10 includes outputdevices 1500. Suitable output devices include speakers, printers,network interfaces, and monitors.

Graphics display 1700 may include a liquid crystal display (LCD) orother suitable display device. Graphics display 1700 receives textualand graphical information, and processes the information for output tothe display device.

Peripherals 1800 (also referred to as peripheral devices 1800) mayinclude any type of computer support device to add additionalfunctionality to the computing system. Peripheral device(s) 1800 mayinclude a modem or a router.

The components provided in the computing system 1000 of FIG. 10 arethose typically found in computing systems that may be suitable for usewith embodiments of the present technology and are intended to representa broad category of such computer components that are well known in theart. Thus, the computing system 1000 of FIG. 10 may be a personalcomputer, hand held computing system, telephone, mobile computingsystem, workstation, server, minicomputer, mainframe computer, or anyother computing system. The computer may also include different busconfigurations, networked platforms, multi-processor platforms, etc.Various operating systems may be used including Unix, Linux, Windows,Macintosh OS, Palm OS, Android, iPhone OS and other suitable operatingsystems.

It is noteworthy that any hardware platform suitable for performing theprocessing described herein is suitable for use with the technology.Computer-readable storage media refer to any medium or media thatparticipate in providing instructions to a central processing unit(CPU), a processor, a microcontroller, or the like. Such media may takeforms including, but not limited to, non-volatile and volatile mediasuch as optical or magnetic disks and dynamic memory, respectively.Common forms of computer-readable storage media include a floppy disk, aflexible disk, a hard disk, magnetic tape, any other magnetic storagemedium, a CD-ROM disk, digital video disk (DVD), any other opticalstorage medium, RAM, PROM, EPROM, a FLASHEPROM, any other memory chip orcartridge.

While various embodiments have been described above, it should beunderstood that they have been presented by way of example only, and notlimitation. The descriptions are not intended to limit the scope of thetechnology to the particular forms set forth herein. Thus, the breadthand scope of a preferred embodiment should not be limited by any of theabove-described exemplary embodiments. It should be understood that theabove description is illustrative and not restrictive. To the contrary,the present descriptions are intended to cover such alternatives,modifications, and equivalents as may be included within the spirit andscope of the technology as defined by the appended claims and otherwiseappreciated by one of ordinary skill in the art. The scope of thetechnology should, therefore, be determined not with reference to theabove description, but instead should be determined with reference tothe appended claims along with their full scope of equivalents.

What is claimed is:
 1. A method for quantifying relevancy of a documentto a training corpus, the method comprising: classifying a firsttraining corpus using a classifier to calculate an internal best matchscore for a relevant document relative to each training example in thefirst training corpus, the relevant document comprising a member of adocument universe, the classification including: calculating cosinedistances between the relevant document and training examples in thefirst training corpus relative to term frequency-inverse documentfrequency weights associated with the training examples of the firsttraining corpus; and determining the training example in the firsttraining corpus having a closest cosine distance to the relevantdocument; and providing to an end user the training example from thefirst training corpus having the closest cosine distance to the relevantdocument.
 2. The method according to claim 1, further comprisingconverting each of the training examples into a high-dimensional featurespace using term frequencies.
 3. The method according to claim 1,further comprising calculating an internal best match score for each ofthe training examples by multiplying a square root of term frequenciesby an inverse document frequency.
 4. The method according to claim 1,further comprising: training the classifier on a training corpus thatcomprises training examples; classifying a set of documents; anddetermining relevant documents in the set.
 5. The method according toclaim 4, wherein the classifier comprises a support vector machine. 6.The method according to claim 4, wherein determining relevant documentsin the set comprises determining distances between each document withinthe set of documents relative to a support vector machine model.
 7. Themethod according to claim 1, further comprising outputting a list of thetraining examples based upon ranked cosine distances between therelevant document and the training examples.
 8. The method according toclaim 7, further comprising applying a relevancy threshold to affect anamount of training examples that are included in the list.
 9. The methodaccording to claim 1, wherein determining the training example havingthe closest cosine distance to the relevant document further comprisesranking the training examples by stretching the internal best matchscores for the training examples linearly to cover a complete unitinterval.
 10. A machine learning system that quantifies relevancy of adocument to a training corpus, the system comprising: at least oneserver comprising a processor configured to execute instructions thatreside in memory, the instructions comprising: a classifier module that:calculates an internal best match score for a relevant document relativeto each training example in the first training corpus, the relevantdocument being a member of a document universe, the classificationincluding: calculating cosine distances between the relevant documentand training examples in the first training corpus relative to termfrequency-inverse document frequency weights associated with thetraining examples of the first training corpus; and determining thetraining example in the first training corpus having a closest cosinedistance to the relevant document; and a user interface module thatprovides to an end user the training example from the first trainingcorpus having the closest cosine distance to the relevant document. 11.The machine learning system according to claim 10, wherein each of thetraining examples has been converted into a high-dimensional featurespace using term frequencies.
 12. The machine learning system accordingto claim 10, wherein the classifier module calculates an internal bestmatch score for each of the training examples by multiplying a squareroot of term frequencies by an inverse document frequency.
 13. Themachine learning system according to claim 10, wherein the classifiermodule further: classifies a set of documents; and determines relevantdocuments in the set, the classifier module being trained on a trainingcorpus that comprises training examples.
 14. The machine learning systemaccording to claim 13, wherein the classifier module comprises a supportvector machine.
 15. The machine learning system according to claim 13,wherein the classifier module determines relevant documents in the setby determining cosine distances between each document within the set ofdocuments using a support vector machine model.
 16. The machine learningsystem according to claim 10, wherein the user interface module furtheroutputs a list of the training examples based upon ranked cosinedistances between the relevant document and the training examples. 17.The machine learning system according to claim 16, wherein theclassifier module further applies a relevancy threshold to affect anamount of training examples that are included in the list.
 18. Themachine learning system according to claim 10, wherein the classifiermodule determines the training example having the closest cosinedistance to the relevant document by ranking the training examples bystretching the internal best match scores for the training exampleslinearly to cover a complete unit interval.